dflash: split target/draft StepGraphs to fix ggml_gallocr realloc per spec-decode step (issue #55)#62
Open
dusterbloom wants to merge 3 commits intoLuce-Org:mainfrom
Open
Conversation
…-decode step Issue Luce-Org#55: every spec-decode iteration calls build_target_step_tree (target verify, ~3127 graph nodes) and build_draft_step (draft forward, ~186 graph nodes) on the SAME StepGraph, sharing one ggml_gallocr. ggml_gallocr_needs_realloc compares galloc->n_nodes to graph->n_nodes, so every call sees a mismatch left over from the previous call's opposite topology, forcing ggml_gallocr_reserve to re-walk the entire graph (CPU cost) and often cudaFree+cudaMalloc the activation buffer (GPU driver cost). Reporter on Windows/RTX 4090 sees the "graph has different number of nodes" debug log fire every step and decode tok/s halving from 90 @ 16k context to 55 @ 32k context. Fix: introduce target_sg and draft_sg, each with its own ggml_gallocr. Target verify settles into the 3127-node graph topology, draft into the 186-node topology, and neither bounces. Existing prefill / target-verify call sites keep their `sg` references via a StepGraph & sg = target_sg alias; only the draft block (~10 calls) swaps `sg.X` for `draft_sg.X`. Daemon-mode reset and migrate-cache sites destroy both StepGraphs. Verified with one-line instrumentation patch on ggml_gallocr_alloc_graph (unconditionally fprintf to stderr at each "needs_realloc returns true" site, removing the #ifndef NDEBUG gate the upstream messages are silenced by in Release builds). HE prompt 00 + ddtree-budget=22 + n_gen=256 over 26 spec-decode steps: Before: 56 needs_realloc events (alternating n_nodes 186 ↔ 3127), 14 cudaFree+cudaMalloc events. After: 3 needs_realloc events (initial only: 0 -> 3127, 0 -> 3079, 0 -> 186), 0 cudaFree+cudaMalloc events during decode. bench_he.py (RTX 3090, --n-gen 128, --ddtree-budget 22, 3-run mean): main: 86.72 tok/s this fix: 84.99 tok/s Within bench-noise on Linux/CUDA 12.6 because cudaMalloc is cheap on this stack — the saved per-step cost is small. The reporter's stack (Windows/CUDA 13/RTX 4090) has a slower stream-allocator where the saved cost should translate into measurable tok/s recovery; that needs verification on the reporter's box.
…v-pad # Conflicts: # dflash/test/test_dflash.cpp
05cb709 to
0ce6832
Compare
Contributor
|
@gtrak can you verify if this is solved? |
javierpazo
added a commit
to javierpazo/lucebox-hub
that referenced
this pull request
May 10, 2026
This change brings concurrent multi-request execution to test_dflash
on a single GPU. It is internally one cohesive unit but can be split
into four conceptual pieces if a smaller review is preferred:
1. Multi TargetCache slots
- CLI: --target-cache-slots=N (alias --cache-slots=N)
- prefix `SLOT <id>` routes commands to a specific slot
- DaemonSlotState + RAII ActiveDaemonSlot for safe switching
- LIST_TARGET_CACHE_SLOTS for introspection
- all slots share target/draft weights; only KV/SSM/scratch is
per-slot
- create_target_cache gains an `n_seqs` parameter so a single
cache can be allocated batched up front
2. Tagged stream protocol (opt-in)
- --stream-tagged emits frames `[-2, request_id, token]` instead
of bare int32 tokens; sentinels `-4` (CONTINUE), `-1` (DONE)
- parser recognises `REQ <id>` / `REQUEST <id>` headers
- legacy bare-int32 streaming is unchanged when the flag is off
- this lets a client demux multiple concurrent requests over the
same stdout
3. Native quantum scheduler
- dispatch table for REQ/SLOT/START, SCHED_STEP, SCHED_DRAIN,
LIST_REQUESTS
- cursor-based fair round-robin between admitted requests
- non-blocking reader thread admits new requests during a drain
- PendingQuantum{slot, req, epoch, n_gen} carries the unit of work
- CONTINUE / CONT resumes a slot without re-prefilling
- REQ <id> CANCEL invalidates a request and bumps the slot epoch
so a stale CONTINUE is rejected; RESTORE_CHAIN / legacy generate
refuse to overwrite a slot that is owned by an active scheduler
request
4. Fused batched target step (CUDA path)
- new commands: SCHED_BATCH_PEEK, SCHED_BATCH_PROBE,
SCHED_BATCH_TARGET_TAIL, SCHED_BATCH_TARGET_STEP,
SCHED_BATCH_DRAIN
- QwenGraphInputs gains `n_seqs`; build_delta_net_block accepts
n_seqs > 1
- target_feat is allocated as [5*hidden, target_feat_cap, n_seqs]
when batched and the chain forwards capture features per-seq
- batch_probe_compare_ok smoke shows mismatches=0 vs the
single-seq path; SCHED_BATCH_TARGET_TAIL commits two completed
pending quanta in 29.26 ms; SCHED_BATCH_TARGET_STEP commits the
next batched step in 29.57 ms; SCHED_BATCH_DRAIN completes
req12/req13 with two batched steps each
- rollback for partially accepted draft tokens, multi-token verify
and parent-id propagation in the batched path are noted as
follow-ups; today the batched step accepts the cleanest case
and falls back to single-seq when needed
Validation (single GPU1 RTX 6000 Ada sm_89, Heretic Q4_K_M target +
Q8 GGUF or FP16 safetensors drafter, FA_WINDOW=0, KV q4_0/q4_0):
- Two concurrent requests:
REQ 4 START SLOT 0 quantum=2
REQ 5 START SLOT 1 quantum=2
SCHED_DRAIN closes both clean.
slot 0: 18.41 tok/s, slot 1: 22.50 tok/s
- Mid-drain admission of REQ 6 succeeds; CONTINUE on slot 0 resumes
without re-prefill.
- batch_probe_compare_ok mismatches=0 over a 2-seq probe.
- batch_tail_commit count=2 ms=29.26.
- batch_step_commit ms=29.57 followed by SCHED_DRAIN reverts cleanly
back to the DFlash single-seq path.
Compatibility:
- All new behaviour is opt-in. Default invocation of test_dflash
with no scheduler flags keeps the legacy single-request path.
- Tagged stream is gated behind --stream-tagged.
- Multi-slot is gated behind --target-cache-slots=N (default N=1).
- Batched target step is reached only via the SCHED_BATCH_* command
family; legacy SCHED_STEP keeps using the single-seq path.
- Hot-loop diagnostic logs (sync_us / step_debug) are now gated
behind DFLASH27B_TIMING_DEBUG / DFLASH27B_STEP_DEBUG so the
default path is unchanged.
Verification vs existing community PRs:
- No prior art in lucebox-hub for the SCHED_BATCH_* protocol or for
a native C++ quantum scheduler with REQ/SLOT/CONTINUE/CANCEL +
epoch hardening. Checked against PR Luce-Org#39 (CUDA graph reuse) and
PR Luce-Org#62 (split target/draft StepGraphs); both reuse / split graphs
but neither exposes a multi-request slot protocol.
- No upstream collision found for tagged stream framing or
--target-cache-slots.
Happy to split this into four sequential PRs (slots / tagged stream /
quantum scheduler / batched target step) if a smaller-grained review
is preferred — let me know.
Author: Javier Pazo <xabicasa@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #55.
Root cause
Every spec-decode iteration calls
build_target_step_tree(target verify, ~3127 ggml graph nodes) atdflash/test/test_dflash.cpp:1703andbuild_draft_step(draft forward, ~186 nodes) atdflash/test/test_dflash.cpp:1556on the sameStepGraph sg, sharing oneggml_gallocr.ggml_gallocr_needs_realloccomparesgalloc->n_nodestograph->n_nodes, so every call sees a mismatch left over from the previous call's opposite topology — forcingggml_gallocr_reserveto re-walk the entire graph (CPU cost) and oftencudaFree+cudaMallocthe activation buffer (GPU driver cost).Reporter on Windows/RTX 4090/CUDA 13 sees
ggml_gallocr_needs_realloc: graph has different number of nodeslog spam every step (the message is#ifndef NDEBUGgated, so Linux Release builds are silent but pay the same cost). Decode tok/s halves from 90 @ 16k context to 55 @ 32k context.Fix
Split the shared
StepGraph sgintotarget_sganddraft_sg, each with its ownggml_gallocr. Target verify settles into the 3127-node topology, draft into 186-node, neither bounces.The diff is minimized via a
StepGraph & sg = target_sg;alias so the existing prefill/target-verify call sites are unchanged; only the draft block (~10 references) swapssg.Xfordraft_sg.X. Daemon-mode reset and the two migrate-cache sites destroy both StepGraphs.Verification
Patched
ggml_gallocr_alloc_graphto unconditionallyfprintfto stderr at each "needs_realloc returns true" site (removing the#ifndef NDEBUGgate). Rantest_dflashon a tokenized HE prompt + n_gen=256 + --ddtree-budget=22 + --max-ctx=2048 + --fast-rollback. Same prompt, same flags, before vs after this commit:needs_reallocevents over 26 stepscudaFree+cudaMallocevents during decodeReasons breakdown before fix:
n_nodes 186 → 3127n_nodes 3127 → 186(the alternation)kv_padgrowth)After fix: just the 3 initial reserves (one per gallocr, each fired exactly once at first use).
Bench (RTX 3090 / Linux / CUDA 12.6)
bench_he.py --n-gen 128 --ddtree-budget 22, 3-run mean:Within bench-noise. This is consistent with the hypothesis that on Linux/CUDA 12.6 the per-step
cudaFree+cudaMalloccost is small (driver fast-paths the alloc), so eliminating it doesn't show up as decode tok/s. The reporter's Windows/CUDA 13 stack has a slower stream-allocator where the saved cost should translate into measurable tok/s recovery — needs verification on their box.Test plan
bench_he.pyparity (within noise).What this does NOT fix
needs_reallocevents from monotonickv_padgrowth at long context — these are rare boundary crossings (every ~256 tokens), not per-step. Codex review flagged these as low-severity; not chasing unless the reporter still sees churn.StepGraph & sg = target_sg;alias as a low-severity readability footgun — a follow-ups/sg/target_sg/gwould clarify. Held to keep this PR minimal.